EDA

Row

Genre distribution for 6 most frequent actors

Distribution of Duration for Countries with the Most Contents

Row

Amount Of Netflix Content By Type

Movie Rating Analysis

Text Visualization

Most frequent word used in movie titles


Based on the wordcloud on the left side, love, story, christmas, world are the most frequent word used in the title of movies. Therefore, we could conclude that christmas movies always occupy a large proportion of all movies. Also, romantic movies is a main type in movie market. Besides these two types, by combining words like rangers, super, world and power, we assume that hero movies such as movies related to Marvel Cinematic Universe hit the market.

Through this bar plot, we see the distribution of length of movie/shows titles. The average lenth is 3 words,and the longest title has 17 words

Time Series

By Type (Movies/TV)


The growth in number of movies on netflix is much higher than that od TV shows. About 1300 new movies were added in both 2018 and 2019. The growth in content started from 2013. Netflix kept on adding different movies and TV shows on its platform over the years. This content was of different variety - content from different countries, content which was released over the years.

By Movie/TV Category


The growth rate for International Movies, Dramas and Comedies are the top 3. Besides these three, the growth rate for International TV Shows, Independent Movies and Action & Adventure are also very fast compared to the rest of categoreis.

Geographical

Column

Map

Network

Network

About

This dataset consists of TV shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. Therefore, the objective of the visual analysis is to explore what all other insights can be obtained from this data set, which include but not limited to questions like:

  1. What's the time duration for movies/TV shows from different countries?
  2. What kind of genres do top actors (by frequency) belong to?
  3. What are the most frequent words used in movie titles?
  4. What's the growth rate for movie and TV shows over years?

The application is layout is produced with the flexdashboard package, and the charts,maps and network use Plotly, ggplot2, igraph, and wordcloud, all accessed through their corresponding R packages.

---
title: "Netflix Dataset Analysis"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    storyboard: true
    social: menu
    source_code: embed
---

```{r setup, include=FALSE}
library(stringr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(forcats)
library(ggthemes)
library(tidyverse)
library(wordcloud)
library(tokenizers)
library(countrycode)
library(tm)
library(proxy)
library(igraph)
library(RColorBrewer)
library(plotly)

# Read the data
df <- read.csv('/Users/huxin/Desktop/netflix_titles.csv')

# Data Preprocess
df['year'] <- str_sub(df$date_added,-4,-1)  # add a new year column

df['season_count'] <- ifelse(grepl("Season", df$duration) == 'TRUE', str_sub(df$duration,0,1), NA)  # Split the duration column into 2 columns
df['duration'] <- ifelse(grepl("Season", df$duration) == 'FALSE', sapply(strsplit(as.character(df$duration), " "), "[", 1), NA) 
```


EDA
=======================================================================

Inputs {.sidebar}
-----------------------------------------------------------------------
Chart 1 (Top Left): From this figure, we could find that for the most frequent actors, 
the international movies is the most frequent genre. Dramas and Comedies are also very
popular.

Chart 2 (Top Right): when exploring the difference in duration, we see that for top 5 countries 
with the most contents of movies and TV shows, the distribution for Canada, Japan and United 
Kingdom are very similar. While India generally has longer duration than the rest of countries. US 
also has some outliers which have longer duration.

Chart 3 (Bottom Left): Through this simple pie chart, it's easy to observe the contribution of Netflix 
content by type. Movies account for a larger proportion compared to TV shows.

Chart 4 (Bottom Right): From the distribution of different ratings, the largest count of movies is made with the 'TV-MA' rating. "TV-MA" is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only. Second largest is the 'TV-14' stands for content that may be inappropriate for children younger than 14 years of age. Third largest is the very popular 'R' rating. An R-rated film is a film that has been assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America.

Row{data-height=200}
-------------------------------------
    
### Genre distribution for 6 most frequent actors
    
```{r,fig.width=7}
# What kind of genres do top actors (by frequency) belong to?
df_cast <- df %>% 
    mutate(cast = strsplit(as.character(cast), ",")) %>% 
    unnest(cast) %>%
    mutate(cast = trimws(cast, which = c("left"))) %>%  #eliminate space on the left side
    group_by(cast) %>%
    add_tally() %>%
    select(cast,n,listed_in) %>%
    unique()
  
df_actor_top <- df_cast[order(-df_cast$n),]

#count the genres 
df_actor_top_genre <- df_actor_top %>%
  select(cast, listed_in) %>%
  mutate(listed_in = strsplit(as.character(listed_in), ",")) %>% 
  unnest(listed_in) %>%
  mutate(listed_in = trimws(listed_in, which = c("left"))) %>% #eliminate space on the left side
  group_by(cast,listed_in) %>%
  add_tally() %>%
  unique()

df_actor_top_only <- df_actor_top[,1:2] %>% 
                     unique()
df_actor_top_only <- df_actor_top_only[1:30,]
df_actor_top_only <- df_actor_top_only[order(-df_actor_top_only$n),][1:6,]
  
df_actor_top_5_genre <- df_actor_top_genre[df_actor_top_genre$cast %in% df_actor_top_only$cast,]
  


mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(23)

ggplot(data = df_actor_top_5_genre, aes(x = "", y = n, fill = listed_in )) + 
  facet_wrap(~ cast)  +
  geom_bar(stat = "identity",position = position_fill()) +
  coord_polar(theta = "y") +
  theme_tufte() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank())+
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank())+
  theme(legend.title = element_blank())+
  theme(plot.title = element_text(size = 15)) +
  scale_fill_manual(values = mycolors)
```
   
   
### Distribution of Duration for Countries with the Most Contents

```{r,fig.width=7}
# Duration distribution for five top countries
df <- df[(!df$country == ""), ]

top_country <- df %>%
  group_by(country) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(5) 

df_top <- df[df$country %in% top_country$country,]
df_top <- df_top[!is.na(df_top$duration), ]
df_top$duration <- as.integer(df_top$duration)


ggplot(df_top, aes(x = country, y = duration, fill = country)) + 
  geom_violin() + 
  geom_boxplot(width = 0.1) +
  labs(x = "", y = "Duration(minutes)") +  
  theme_tufte() +
  theme(plot.title = element_text(size = 16),
       axis.text.x = element_text(size = 10),
       axis.text.y = element_text(size = 10),
       axis.title = element_text(size = 13))
```

   
   
Row{data-height=200}
-------------------------------------
    
### Amount Of Netflix Content By Type
    
```{r,fig.width=7}
amount_by_type <- df %>% group_by(type) %>% summarise(
  count = n()
)

fig1 <- plot_ly(amount_by_type, labels = ~type, values = ~count, type = 'pie', marker = list(colors = c("#bd3939", "#399ba3")))
fig1 <- fig1 %>% layout(
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig1
```
    
### Movie Rating Analysis

```{r,fig.width=7}
# Bar plot for distribution of ratings
df <- df[(!df$rating == ""), ]

colourCount = length(unique(df$rating))
getPalette = colorRampPalette(brewer.pal(2, "Set1"))

ggplot(df, aes(x = fct_infreq(factor(rating)),fill = factor(rating))) + 
  geom_bar(stat = "count", width = .5) + 
  scale_fill_manual(values = getPalette(colourCount)) +
  labs(x = 'rating', subtitle = "Count vs. Rating") + 
  theme_tufte() +
  theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) +
  theme(legend.position = "none")
```


Text Visualization{.storyboard data-navmenu="More Visualizations"}
=======================================================================

### Most frequent word used in movie titles
    
```{r}
# Most frequent word used in movie titles

tot_title <- paste(df[,3],collapse = " ")  #All titles into one 
tot_title_words <-  tokenize_words(tot_title)  #Tokenize sentence to words
words.freq <- table(unlist(tot_title_words))

result <- cbind.data.frame(words = names(words.freq) ,amount = as.integer(words.freq))
result_dec <- result[order(-result$amount),]

result_dec_filter <- result_dec %>%
  filter(nchar( as.character(words)) > 3)  #Filter out the useless characters

wordcloud(words = result_dec_filter$word, freq = result_dec_filter$amount, min.freq = 1,  max.words = 150, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8,"Set2"), main="Most frequent word used in movie titles") 
```


***
Based on the wordcloud on the left side, love, story, christmas, world are the most frequent word used in
the title of movies. Therefore, we could conclude that christmas movies always occupy a large proportion 
of all movies. Also, romantic movies is a main type in movie market. Besides these two types, by combining 
words like rangers, super, world and power, we assume that hero movies such as movies related to Marvel 
Cinematic Universe hit the market.
    
    
### Through this bar plot, we see the distribution of length of movie/shows titles. The average lenth is 3 words,and the longest title has 17 words

```{r}
# How long are the titles movie/shows? And which one is the longest?

library(stringr)

title_length <- sapply(strsplit(as.character(df[,3]), " "), length)


title_length <- cbind.data.frame(num = title_length, titles = df[,3] )

mean_title <- mean(title_length$num)
max_title <- max(title_length$num)

Longest_title <- title_length %>% 
filter(num == max_title)


a  <- ggplot(title_length, aes(num)) +
  geom_bar(fill = "rosybrown3") +
  theme_bw() +
  xlab("Title length ( # words)") + ylab("Frequency") +
  theme(legend.position = "none") +
  labs(title = "Length distribution of shows titles")+
  geom_segment(aes(x = 3, y = 0, xend = 3, yend = 1500), linetype = "dashed") +
  ggplot2::annotate("text", x = 5, y = 1500, label = "Average",color = "black", size = 4)

a
```


Time Series{.storyboard data-navmenu="More Visualizations"}
=======================================================================

### By Type (Movies/TV)

```{r}
# The number of Movie and TV shows over years

df['date_added'] <-  as.Date(df$date_added,format = "%B %d, %Y")
Time_Period <- df %>%
  group_by(type,date_added) %>%
  summarise(count = n()) %>%
  mutate(total_shows = cumsum(count))  #Get the cumulative sum of movie and TV shows over years

library(plotly)
fig4 <- plot_ly(Time_Period, x = ~date_added, y = ~total_shows, color = ~type, type = 'scatter', mode = 'lines', colors = c("#bd3939",  "#9addbd", "#399ba3")) 

fig4 <- fig4 %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title = "Amout Of Content As A Function Of Time")

fig4
```

***
The growth in number of movies on netflix is much higher than that od TV shows. About 1300 new movies were added in both 2018 and 2019. The growth in content started from 2013. Netflix kept on adding different movies and TV shows on its platform over the years. This content was of different variety - content from different countries, content which was released over the years.


### By Movie/TV Category{data-commentary-width=200}

```{r}
df_category <- df %>%
  mutate(listed_in = strsplit(as.character(listed_in), ",")) %>% 
  unnest(listed_in) %>%
  mutate(listed_in = trimws(listed_in, which = c("left"))) %>% #eliminate space on the left side
  group_by(listed_in,date_added) %>%
  summarise(count = n()) %>%
  mutate(total_shows = cumsum(count)) 
  
fig5 <- plot_ly(df_category, x = ~date_added, y = ~total_shows, color = ~listed_in, type = 'scatter', mode = 'lines') 

fig5 <- fig5 %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title = "Amout Of Content As A Function Of Time (By Category)")

fig5
```


***
The growth rate for International Movies, Dramas and Comedies are the top 3. Besides these three, the growth rate for International TV Shows, Independent Movies and Action & Adventure are also very fast compared to the rest of categoreis.



Geographical{data-navmenu="More Visualizations"}
=======================================================================

Inputs {.sidebar}
-----------------------------------------------------------------------
The map shows a comparison of contents by country. United States, Canada, Inida, United Kingdom and Japan have the most of contents. Except the top 5 countries, China, France, Australia and Spain also have competitive amounts of movies/TV shows.

Column
-----------------------------------------------------------------------
### Map

```{r}
# Map
df_country <- df %>% 
    mutate(country = strsplit(as.character(country), ",")) %>% 
    unnest(country) %>%
    mutate(country = trimws(country, which = c("left"))) %>%  #eliminate space on the left side
    group_by(country) %>%
    add_tally() %>%
    select(country,n) %>%
    unique()

df_country['Code'] <- countrycode(df_country$country,"country.name", "iso3c")
df_country <- na.omit(df_country)

map <- plot_ly(df_country , type = 'choropleth', locations = df_country$Code, z = df_country$n, text = df_country$country, colorscale = 'Inferno')

map <- map %>% layout(
    title = 'Comparison of Contents by Country
(Cumulative from 2008 to 2020)') map <- map %>% colorbar(title = 'Number of Contents') map ``` Network{.storyboard data-navmenu="More Visualizations"} ======================================================================= Inputs {.sidebar} ----------------------------------------------------------------------- When finding similar movies for movie "The Irishman", we have built a document-term matrix for the description of movies first. Then by computing the cosine similarity, we found three most similar movies which have top 3 similarity scores. The network shows the relationship between these four movies. Nodes we used include cast, country, director, listed(category), movie name, rating, release year, type(movie/TV shows) and cluster. In this network, there are some common cast between these movies, and two of them share the same director. Also, the same type is shared between few of them. We could see there are indeed some common chracteristics between "The Irishman" and three similar movies. Network ----------------------------------------------------------------------- ```{r} df_US <- df %>% filter(country == 'United States', release_year > 2015, type == 'Movie') # Make a corpus from the column containing the document text source <- VectorSource(df_US$description) corpus <- Corpus(source) # Take the standard steps to clean and prepare the data corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removeNumbers) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeWords, stopwords('english')) # Create a document-term matrix: mat <- DocumentTermMatrix(corpus) mat4 <- weightTfIdf(mat) mat4 <- as.matrix(mat4) # normalize the T-Ida scores by euclidean distance norm_eucl <- function(m) m/apply(m,1,function(x) sum(x^2)^.5) mat_norm <- norm_eucl(mat4) set.seed(5) k <- 10 kmeansResult <- kmeans(mat_norm, k) # Cbind the column of cluster df_cluster <- cbind(df_US, cluster = kmeansResult$cluster) ``` ```{r} # Find the Cosine similarity cos_sim = function(matrix){ numerator = matrix %*% t(matrix) A = sqrt(apply(matrix^2, 1, sum)) denumerator = A %*% t(A) return(numerator / denumerator) } dfSim <- cos_sim(mat4) # Remove the diagonal line diag(dfSim) <- NA # Find the three most similar movie for one movie find_similar = function(movie){ similar = df_cluster[df_cluster$title == movie,] index = which(df_cluster$title == movie, arr.ind = TRUE) top_3 = as.data.frame(sort(dfSim[index,], decreasing = TRUE)[1:3]) for (i in 1:3) { index = as.integer(rownames(top_3)[i]) similar = rbind(similar,df_cluster[index,]) } return(similar) } ``` ```{r} sim <- find_similar('The Irishman') # Drop irrelavent columns drops <- c("show_id","date_added","duration","description","season_count","year") sim <- sim[ , !(names(sim) %in% drops)] # Replace empty space with NA sim <- sim %>% mutate_all(na_if,"") sim[is.na(sim)] <- "Not shown" # Add Cast sim_cast <- sim %>% select(title,cast) edge_list <- sim_cast %>% mutate(cast = strsplit(cast, ",")) %>% unnest(cast) %>% mutate(cast = trimws(cast, which = c("left"))) edge_list <- as.data.frame(edge_list) colnames(edge_list) <- c("source", "target") edge_list['attribute'] <- "cast" # Add Listed_in sim_list <- sim %>% select(title,listed_in) edge_list_2 <- sim_list %>% mutate(listed_in = strsplit(listed_in, ",")) %>% unnest(listed_in) %>% mutate(listed_in = trimws(listed_in, which = c("left"))) colnames(edge_list_2) <- c("source", "target") edge_list_2['attribute'] <- "listed" edge_list <- rbind(edge_list, edge_list_2) ``` ```{r,fig.width=9} column_names = colnames(sim[,-c(2,4,8)]) # Create empty dataframe result = data.frame(matrix(ncol = 2, nrow = 0)) x <- c("source", "target") colnames(result) <- x for (i in column_names) { # Add nodes sim_new <- sim %>% select(title,i) edge_list_new <- sim_new %>% mutate(i = strsplit(i, ",")) %>% unnest(i) %>% mutate(i = trimws(i, which = c("left"))) # Change Name colnames(edge_list_new) <- c("source", "target") # Rbind result = rbind(result,edge_list_new) result = as.data.frame(result) } colnames(result) <- c("source", "target","attribute") edge_list <- rbind(edge_list,result) write.csv(edge_list,'edge_list.csv') # Create attribute list attributes <- read.csv('/Users/huxin/Desktop/attributes.csv') # Network net <- graph_from_data_frame(edge_list) net <- set_vertex_attr(net, "type", index = V(net), as.character(attributes$attribute)) colrs <- brewer.pal(9, "Paired") my_color <- colrs[as.numeric(as.factor(V(net)$type))] par(mar = c(0,0,1,0) + .1) plot(net,vertex.size = 13, vertex.frame.color = "gray",vertex.label.color = "black", vertex.label.cex = 0.4, vertex.label.dist = 0, vertex.size = 43, edge.arrow.size = 0.2, vertex.color = my_color) legend("bottomleft", legend = levels(as.factor(V(net)$type)), col = colrs , bty = "n", pch = 20 , pt.cex = 2.5, cex = 0.8, text.col = colrs , horiz = FALSE, inset = c(0.05, 0.05)) title("Similar movies for The Irishman",cex.main = 1.1) ``` About ======================================================================= This dataset consists of TV shows and movies available on [Netflix as of 2019](https://www.kaggle.com/shivamb/netflix-shows). The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. Therefore, the objective of the visual analysis is to explore what all other insights can be obtained from this data set, which include but not limited to questions like: 1. What's the time duration for movies/TV shows from different countries? 2. What kind of genres do top actors (by frequency) belong to? 3. What are the most frequent words used in movie titles? 4. What's the growth rate for movie and TV shows over years? The application is layout is produced with the [flexdashboard](https://rmarkdown.rstudio.com/flexdashboard/) package, and the charts,maps and network use [Plotly]("https://plotly.com/"), [ggplot2]("https://plotly.com/ggplot2/"), [igraph]("https://igraph.org/r/"), and [wordcloud](https://www.r-graph-gallery.com/wordcloud.html), all accessed through their corresponding R packages.